We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

3.00/5 (Submit Your Rating)

Costa Mesa, CA

SUMMARY

  • Over 8+ years of IT experience in Analysis, design, development, implementation, maintenance, and support wif experience in developing strategic methods for deploying big data technologies to efficiently solve Big Data processing requirement.
  • Strong understanding of Waterfall and Agile - SCRUM methodologies.
  • Working experience on Apache Hadoop ecosystem components like MapReduce, HDFS, Hive, Impala, SQOOP, Pig, OOZIE, Zookeeper, Kafka and Apache Spark.
  • Experienced wif workingMapReduce Design Patternsto solvecomplex MapReduce programs.
  • Experience in job workflow scheduling and monitoring tools likeOozieand good knowledge onZookeeperto coordinate the servers in clusters and to maintain the data consistency.
  • Hands-on use of Spark and Scala API's to compare the performance of Spark wif Hive and SQL, and Spark SQL to manipulate Data Frames in Scala.
  • Worked on real time streaming, performed transformations on the data using Kafka and Spark Streaming
  • Experience working wif data modeling tools like Erwin and ER/Studio.
  • Good experience working wif various data analytics and big data services in AWS Cloud like EMR, Redshift, S3, Athena, Glue etc.,
  • Assisted in upgrading, configuration and maintenance of various Hadoop infrastructures like Pig, Hive, and Hbase.
  • DevelopedSqoopFramework to ingest Historical data and incremental data from Oracle, DB2 and SQL Server etc.
  • Hands on experience in developing end to end Spark applications using Spark apis like RDD, Spark Data frame, Spark MLLib, Spark Streaming and Spark SQL.
  • Experienced of buildingData WarehouseinAzure platformusingAzure data bricksanddata factory.
  • Performed data validation and transformation using Python and Hadoop streaming.
  • Documented the requirements including the available code which should be implemented using Spark, Hive, HDFS, HBase and Elastic Search.
  • Developed Oozie workflowschedulers torun multiple Hive and Pig jobsdat run independently wif time and data availability.
  • By using Zookeeper implementation in the cluster, provided concurrent access for hive tables wif shared and exclusive locking
  • Experience in developing Map Reduce Programs using Apache Hadoop for analyzing the big data as per the requirement.
  • UsedOozieworkflow engine to manage independentHadoopjobs and to automate several types ofHadoopsuch as javaMap Reduce,HiveandSqoopas well as system specific jobs
  • Implemented sentiment analysis and text analytics on Twitter social media feeds and market news using Scala and Python.
  • Implemented OLAP multi-dimensional cube functionality using AzureSQL Data Warehouse.
  • Experience developing Kafka producers and Kafka Consumers for streaming millions of events per second on streaming data
  • Excellent experience in analyzing data usingHive QL, PIG Latin, and customMap Reduceprograms inJava.
  • Performed data profiling and transformation on the raw data using Pig, Python, and Java
  • Worked wif NoSQL databases like Hbase, Cassandra, DynamoDB (AWS) and MongoDB.
  • Experience in importing and exporting the data using Sqoop from HDFS to Relational Database systems and vice-versa and load into Hive tables, which are partitioned.
  • Performed data profiling and transformation on the raw data using Pig, Python, and Java
  • Worked on syncing Oracle RDBMS to Hadoop DB (HBase) while retaining oracle as the main data store.
  • Configure Zookeeper to coordinate and support Kafka, Spark, Spark Streaming, HBase and HDFS
  • Sustaining the Big Query, PySpark and Hive code by fixing the bugs and providing the enhancements required by the Business User.
  • DevelopedSqoopFramework to ingest Historical data and incremental data from Oracle, DB2 and SQL Server etc.
  • Expertise in working wif Linux/Unix and shell commands on the Terminal.

TECHNICAL SKILLS

Languages: Python, R, SQL, PL/SQL, Java

Data Visualization: AWS QuickSight, Power BI, Tableau, Informatica, Spotfire, Cognos, Microsoft Excel, PowerPoint

Hadoop Eco System: Hadoop, MapReduce, Spark, HDFS, Sqoop, YARN, Oozie, Hive, Impala, Apache Flume, Apache Storm, Apache Airflow, HBase

Data Analysis: Web Scraping, Data Visualization, Statistical Analysis, Data Mining, Data Warehousing, Data Migration, Database Management

Database: MySQL, SQL Server, Oracle, AWS Redshift

Data Modeling Tools: Erwin Data Modeler, Erwin Model Manager, ER Studio v17, and Power Designer 16.6

Cloud Platform: AWS, Azure, Cloud Stack/Open Stack

Cloud Management: Amazon Web Services(AWS), Amazon Redshift

Testing and defect tracking Tools: HP/Mercury, Quality Center, Win Runner, MS Visio & Visual Source Safe

Operating System: Windows, Unix, Sun Solaris

ETL/Data warehouse Tools: Informatica 9.6/9.1, SAP Business Objects XIR3.1/XIR2, Talend, Tableau, and Pentaho.

OLAP Tools: Tableau

PROFESSIONAL EXPERIENCE

Confidential, Costa Mesa, CA

Senior Big Data Engineer

Responsibilities:

  • Designing the business requirement collection approach based on the project scope and SDLC methodology.
  • Developed and executed customMap Reduceprograms,Pig Latin scriptsandHQL queries.
  • Implemented sentiment analysis and text analytics on Twitter social media feeds and market news using Scala and Python.
  • Used HBase/Phoenix to support front end applications dat retrieve data using row keys
  • Experienced on implementation of a log producer in Scala dat watches for application logs, transform incremental log and sends them to a Kafka and Zookeeper based log collection platform.
  • Files extracted from Hadoop and dropped on daily hourly basis intoS3
  • Authoring Python (PySpark) Scripts for custom UDF’s for Row/ Column manipulations, merges, aggregations, stacking, data labeling and for all Cleaning and conforming tasks.
  • Writing Pig Scripts to generate Map Reduce jobs and performed ETL procedures on the data in HDFS.
  • DesigningOozieworkflows for job scheduling and batch processing.
  • Created Map Reduce programs using Java API dat filter un-necessary records and find out unique records based on different criteria.
  • Developed Spark scripts by using Scala Shell commands as per the requirement.
  • Involved in creating HiveQL on HBase tables and importing efficient work order data into Hive tables
  • Working wif two different datasets one usingHiveQLand other usingPig Latin.
  • Involved in designing and deployment of Hadoop cluster and different Big Data analytic tools including Pig, Hive, HBase, Oozie, Zookeeper, SQOOP, flume, Spark, Impala, and Cassandra wif Horton work Distribution.
  • Worked wif Oozie workflow engine to run multiple Map-R, Hive and Pig jobs.
  • Extensive Experience on importing and exporting data using Flume and Kafka.
  • Good Knowledge and experience inAmazon Web Service (AWS)concepts likeEMR and EC2web services successfully loaded files toHDFSfromOracle, SQL Server, Teradata and Netezza using Sqoop.
  • Developed Map Reduce/Spark Python modules for machine learning & predictive analytics in Hadoop on AWS.
  • Experience in designing and developing applications inSpark using Scalato compare the performance of Spark wif Hive and SQL/Oracle.
  • The data is ingested into dis application by using Hadoop technologies likePIG and HIVE.
  • UsedZookeeperto provide coordination services to the cluster. Experienced in managing and reviewingHadooplog files.
  • Designed and implemented Sqoop for the incremental job to read data from DB2 and load to Hive tables and connected to Tableau for generating interactive reports using Hive server2.
  • Used Sqoop to channel data from different sources of HDFS and RDBMS.
  • Developed Spark applications using Pyspark and Spark-SQL for data extraction, transformation and aggregation from multiple file formats.
  • Used SSIS to build automated multi-dimensional cubes.
  • Responsible for developing data pipeline wifAmazon AWSto extract the data from weblogs and store inHDFSand worked extensively wifSqoopfor importing metadata fromOracle.
  • Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Created a Kafka producer to connect to different external sources and bring the data to a Kafka broker.
  • Installed and configured Hadoop Map Reduce, HDFS and developed multiple Map Reduce jobs in Java for data cleansing and pre-processing.
  • Working on JSON scripts generation and writing UNIX shell scripting to call the SQOOP Import/Export
  • Build and configuredApacheTEZonHiveandPIGto achieve better responsive time while running MR Jobs.
  • Used Spark Streaming to receive real time data from the Kafka and store the stream data to HDFS using Python and NoSQL databases such as HBase and Cassandra
  • Collected data using Spark Streaming from AWS S3 bucket in near-real-time and performs necessary Transformations and Aggregation on the fly to build the common learner data model and persists the data in HDFS.
  • Worked on Dimensional and Relational Data Modeling using Star and Snowflake Schemas, OLTP/OLAP system, Conceptual, Logical and Physical data modeling using Erwin.
  • Automated the data processing wif Oozie to automate data loading into the Hadoop Distributed File System.
  • Created a Kafka broker in structured streaming to get structured data by schema.
  • Start working wif AWS for storgae and halding for tera byte of data for customer BI Reporting tools
  • Working in relational SQL and NoSQL databases, including Oracle, Hive, Sqoop and HBase
  • Experience in designing and developing applications inSpark using Scalato compare the performance of Spark wif Hive and SQL/Oracle.
  • Developed Automation Regressing Scripts for validation of ETL process between multiple databases like AWS Redshift, Oracle, Mongo DB, T-SQL, and SQL Server usingPython.
  • Writing UNIX shell scripts to automate the jobs and scheduling cron jobs for job automation using commands wif Crontab. Created a Lambda Deployment function, and configured it to receive events from S3 buckets
  • Built the machine learning model include: SVM, random forest, XGboost to score and identify the potential new business case wif Python Scikit-learn.
  • Experience in Converting existing AWS Infrastructure to Server less architecture(AWS Lambda, Kinesis),deploying viaTerraformand AWS Cloud Formation templates.
  • Worked onDocker containerssnapshots, attaching to a running container, removing images, managing Directory structures and managing containers.

Environment: HDFS, Hive, Scala, Sqoop, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Elastic search, Kerberos, Jira, Confluence, Shell/Perl Scripting, Zookeeper, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, Openshift, CI/CD(Jenkins), Kubernetes

Confidential, Grapevine, TX

Big Data Engineer

Responsibilities:

  • Analyzing the issue and doing Impact analysis for the same
  • Data Ingest from Sqoop & flume from Oracle data base.
  • Work on implementing various stages of Data Flow in the Hadoop ecosystem - Ingestion, Processing, Consumption
  • Writing PySpark and spark sql transformation in Azure Data bricks to perform complex transformations for business rule implementation
  • Design and develop hive, HBase data structure and Oozie workflow.
  • Extending Hive and Pig core functionality by writing customUDFs, UDTF and UDAFs.
  • Experience in configuring theZookeeperto coordinate the servers in clusters and to maintain the data consistency which is important for decision making in the process.
  • Build and maintain the environment on Azure IAAS, PAAS.
  • Handled importing of data from various data sources, performed transformations using Hive, Map Reduce, and loaded data into HDFS.
  • DevelopedETLjobs usingSpark-Scalato migrate data from Oracle to new hive tables.
  • Experienced in working wif different scripting technologies likePython, Unix shell scripts.
  • Developed simple/complex Map Reduce jobs using Hive and Pig.
  • Designed workflows & coordinators for the task management and scheduling using Oozie to orchestrate the jobs
  • CreateSelf Servicereportingin Azure Data Lake Store Gen2using an ELT approach.
  • Enabling monitoring and azure log analytics to alert support team on usage and stats of the daily runs
  • Responsible for wide-ranging data ingestion using Sqoop and HDFS commands. Accumulate ‘partitioned’ data in various storage formats like text, Json, Parquet, etc. Involved in loading data from LINUX file system to HDFS
  • Experience in Agile Methodologies, Scrum stories and sprints experience in a Python based environment, along wif data analytics and data wrangling.
  • Performance tune up Phoenix/HBase, Hive queries and Spark.
  • Installed Kafka to gather data from disperse sources and store for consumption.
  • The custom File System plugin allows Hadoop Map Reduce programs, HBase, Pig and Hive to work unmodified and access files directly.
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Develop java-spring based middleware component services to fetch data from HBase using Phoenix SQL layer for various WEB UI use cases.
  • Implement Continuous integration/continuous development best practice using Azure Devops, ensuring code versioning
  • Installed and configured Hadoop eco system components
  • Developed Spark application dat uses Kafka Consumer and Broker libraries to connect to Apache Kafka and consume data from the topics and ingest them into Cassandra.
  • Designed and implemented Map Reduce-based large-scale parallel relation-learning system
  • Developed Spark code using Scala and Spark-SQL/Streaming for faster testing and processing of data.
  • Data acquisition from REST API / json; data wrangling wif iPython and unix tools; segment and organize data from disparate sources and data loading to Google BigQuery
  • Wrote the PIG UDF in for converting Date format and time stamp formats from the unstructured files to required date formats and processed the same.
  • Decommissioning nodes and adding nodes in the clusters for maintenance
  • Monitored cluster health by Setting up alerts using Nagios and Ganglia
  • Work wif subject matter experts and project team to identify, define, collate, document and communicate the data migration requirements.
  • Developed applications involving Big Data technologies such as Hadoop, Spark, Map Reduce, Yarn, Hive, Pig, Kafka, Oozie, Sqoop, and Hue.
  • Built customtableau/ SAP Business Objectsdashboards for the Salesforce for accepting the parameters from the Salesforce to show the relevant data for dat selected object.
  • Hands on Ab initio ETL, Data Mapping, Transformation and Loading in complex and high-volume environment
  • Imported/exported data into HDFS and Hive using Sqoop and Kafka.
  • Experienced in developing Map Reduce programs using Apache Hadoop for working wif Big Data
  • Validate Scoop jobs, Shell scripts & perform data validation to check if data is loaded correctly wifout any discrepancy. Perform migration and testing of static data and transaction data from one core system to another.
  • Applied Apache Kafka to transform live streaming wif the batch processing to generatereports.
  • Extensively using open source languagesPerl,Python,ScalaandJava.
  • Writing scripts for creating, truncating, dropping, altering HBase tables to store the data after execution of map reduce job and to use dat for later analytics.
  • Develop best practice, processes, and standards for TEMPeffectively carrying out data migration activities. Work across multiple functional projects to understand data usage and implications for data migration.
  • Prepare data migration plans including migration risk, milestones, quality and business sign-off details.

Environment: Sqoop, Hive, Azure, Json, XML, Kafka, Python, MapReduce, oracle, Agile Scrum, MapReduce, Pig, Spark, Scala, Hive, Azure, Azure Data Bricks, DAX, Azure Data Lake, Kafka, Python.

Confidential, Bellevue, WA

Big Data Engineer

Responsibilities:

  • Working as a Data Engineer utilizing Big data & Hadoop Ecosystems components for building highly scalable data pipelines.
  • Worked in Agile development environment and participated in daily scrum and other design related meetings.
  • UpdatedPythonscripts to match training data wif our database stored inAWS Cloud Search, so dat we would be able to assign each document a response label for further classification.
  • Experienced in developing Map Reduce programs using Apache Hadoop for working wif Big Data
  • Working on JSON scripts generation and writing UNIX shell scripting to call the SQOOP Import/Export
  • Have good knowledge on NoSQL databases likeHBase,CassandraandMongoDB.
  • Analyzed the data by performing Hive queries and runningPigscripts to know user behavior like shopping enthusiasts, travelers, music lovers etc.
  • Analyze customer, behavior data, symptoms data, transaction data and campaign data to identify trends and patterns of data in different visualization techniques like Seaborn library in PYTHON.
  • Application development using Hadoop Ecosystems such as Spark, Kafka, HDFS, HIVE, Oozie and Sqoop.
  • Involved in support forAmazon AWS and RDSto host static/media files and the database intoAmazon Cloud.
  • Developed simple/complex Map Reduce jobs using Hive and Pig.
  • Implemented data access jobs through Pig, Hive,Tez,Solar, Accumulo,HBase, andStorm.
  • Installed and configured Hive, Pig, Sqoop, Flume andOozieon the Hadoop cluster.
  • Wrote script in python to predict number of people getting TEMPeffect of some diseases, by collecting set of predicted (symptoms) data from all medical sectors and evaluated wif outcome data and Make the aware of people using Machine Learning Module like logistic regression.
  • Experienced wif performingCURDoperations in HBase.
  • Designed and developed many Real time applications in Talend wif Spark and Kafka.
  • UsedAWS SDKfor connection toAmazon S3 bucketsas it is used as the object storage service to store and retrieve the media files related to the application.
  • Experience in Hadoop streaming and writing MR jobs by using Perl, Python other than JAVA.
  • Involved in scheduling Oozie jobs
  • Involved in converting Hive/SQL queries into Spark transformations using Scala.
  • Worked on AWS Data pipeline to configure data loads from S3 to into Redshift.
  • Using AWS Redshift, me Extracted, transformed and loaded data from various heterogeneous data sources and destinations
  • Created Tables, Stored Procedures, and extracted data using T-SQL for business users whenever required.
  • Proficient in developing real time pipelines using Kafka connect, Kafka stream, stream sets and other real time processing components.
  • Performs data analysis and design, and creates and maintains large, complex logical and physical data models, and metadata repositories using ERWIN and MB MDR
  • me have written shell script to trigger data Stage jobs.
  • Assist service developers in finding relevant content in the existing reference models.
  • Like Access, Excel, CSV, Oracle, flat files using connectors, tasks and transformations provided by AWS Data Pipeline.

Environment: Hadoop, Hdfs, Spark, Kafka, Python, Spark, Pig, Aws, Sqoop, hive, HBase, Oozie, Sqoop.

Confidential

Data Engineer

Responsibilities:

  • Developed batch scripts to fetch the data from AWS S3 storage and do required transformations in Scala using Spark framework.
  • Used Spark Streaming APIs to perform transformations and actions on the fly for building common learner data model which gets the data from Kafka in near real-time and persist it to Redshift clusters.
  • Optimization of Hive queries using best practices and right parameters and using technologies like Hadoop, YARN, Python, Pyspark.
  • Analysed the sql scripts and designed it by using Pyspark SQL for faster performance.
  • Created Sqoop Scripts to import and export customer profile data from RDBMS to S3 buckets.
  • Troubleshooting Spark applications for improved error tolerance and reliability.
  • Used Spark Data frame and Spark API to implement batch processing of Jobs.
  • Used Apache Kafka and Spark Streaming to get the data from adobe live stream rest api connections.
  • Automated creation and termination of AWS EMR clusters.
  • Implemented Python libraries such as Numpy, Matplotlib, Pandas, SKlearn and used them to create dashboards and visualizations, wif using IDE - Spyder/ Jupyter notebook.
  • Used various concepts in spark like broadcast variables, caching, dynamic allocation etc to design more scalable spark applications.
  • Implemented continuous integration and deployment using CI/CD tools like Jenkins, GIT, Maven.
  • Installed and Configured Jenkins Plugins to support the project specific tasks.

Environment: AWSEMR, Spark, Hive, HDFS, Sqoop, Kafka, Oozie, HBase, Scala, Map Reduce.

Confidential

Junior Hadoop Developer

Responsibilities:

  • Participated in Data Acquisition wif Data Engineer team to extract clinical and imaging data from several data sources like flat file and other databases.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Utilized the clinical data to generate features to describe the different illnesses by using LDA Topic Modelling.
  • Used PCA to reduce dimension and compute eigenvalue and eigenvector and used OpenCV to analysis the CT scan pictures to figure out the disease in CT scan.
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Used various OLAP operations like slice / dice, drill down and roll up as per business requirements.
  • Build machine learning models to showcase Big data capabilities using Pyspark and MLlib.
  • Enhancing Data Ingestion Framework by creating more robust and secure data pipelines.
  • Worked wif multiple storage formats (Avro, Parquet) and databases (Hive, Impala, Kudu).

Environment: Hadoop, HDFS, Map Reduce, Pig, Hive, Impala, SQL, OLAP, MS Office, Windows

We'd love your feedback!