We provide IT Staff Augmentation Services!

Senior Big Data Engineer Resume

2.00/5 (Submit Your Rating)

Bothell, WA

PROFESSIONAL SUMMARY:

  • Over 8+ years of Big Data Hadoop Ecosystems experience in ingestion, storage, querying, processing and analysis of big data.
  • Experience in dealing with Apache Hadoop components likeHDFS, MapReduce, HIVE, HBase, PIG, SQOOP, Spark and Flume Big Data and Big Data Analytics.
  • Hands on Experience on Python programming for data processing and to handle Data integration between On - prem and Cloud DB or Data warehouse.
  • Involved in writingdata transformations, data cleansingusingPIG operationsand good experience in data retrieving and processing usingHIVE.
  • Hands on experience in installing, configuring Hadoop ecosystems such asHDFS, MapReduce, Yarn, Pig, Hive, HBase, Oozie, Sqoop, flumeandKafka.
  • ConfiguredSpark Streamingto receive real time data fromKafkaand store the stream data to HDFS and process it usingSparkandScala.
  • Expertise with Python, Scala and Java in Design, Development, Administrating and Supporting of large-scale distributed systems.
  • Hands-on experience withAmazon EC2, Amazon S3, Amazon RDS, VPC, IAM, Amazon Elastic Load Balancing, Auto Scaling, Cloud Front, CloudWatch, SNS, SES, SQSand other services of the AWS family.
  • Experience in developing MapReduce Programs using Apache Hadoop for analyzing the big data as per the requirement..
  • Extensive knowledge in various reporting objects like Facts, Attributes, Hierarchies, Transformations, filters, prompts, Calculated fields, Sets, Groups, Parameters etc., in Tableau.
  • Experience with building data pipelines using Azure Data Factory, Azure Databricks, and stacking data to Azure Data Lake, Azure SQL Database, Azure SQL Data Warehouse to control and concede database access.
  • Good knowledge in implementing various data processing techniques using Apache HBase for handling the data and formatting it as required.
  • Experience in utilizing SAS Procedures, Macros, and other SAS application for data extraction using Oracle and Teradata.
  • Experience in designing star schema, Snowflake schema for Data Warehouse, ODS architecture.
  • Good knowledge of Data Marts, OLAP, Dimensional Data Modeling with Ralph Kimball Methodology (Star Schema Modeling, Snow-Flake Modeling for FACT and Dimensions Tables) using Analysis Services.
  • Hands-On experience on Analyzing SAS ETL, Implementation of Data integration in Informatica using XML.
  • Experienced in building Automation Regressing Scripts for validation of ETL process between multiple databases like Oracle, SQL Server, Hive, and Mongo DB usingPython.
  • Proficiency in SQL across several dialects (we commonly write MySQL, PostgreSQL, Redshift, SQL Server, and Oracle)
  • Proficient inData Analysis, Cleansing, Transformation, Data Migration, Data Integration, Data Import, and Data Exportthrough use of ETL tools such as Informatica.
  • Analyzed data and provided insights with R Programming and Python Pandas

TECHNICAL SKILLS:

Big Data Technologies: Hadoop, HDFS, Hive, Pig, HBase, Sqoop, Flume, Yarn, Spark SQL, Kafka, Presto

Languages: Python, Scala, PL/SQL, SQL, T-SQL, UNIX, Shell Scripting

Cloud Platform: AWS (Amazon Web Services), Microsoft Azure, Snowflake

BI Tools: SSRS, SSAS.

Modeling Tools: IBM Info sphere, SQL Power Architect, Oracle Designer, Erwin, ER/Studio, Sybase Power Designer.

Database Tools: Oracle 12c, MS Access, Microsoft SQL Server, Teradata, Poster SQL, Netezza

Reporting Tools: Business Objects, Crystal Reports.

Tools: & Software: TOAD, MS Office, BTEQ, Teradata SQL Assistant.

ETL Tools: Informatica Power, SAP Business Objects XIR3.1/XIR2, Web Intelligence.

Operating System: Windows, ZOS, Unix, Linux

Other tools: TOAD, SQL PLUS, SQL LOADER, MS Project, MS Visio and MS Office, have worked on C++, UNIX, PL/SQL etc.

PROFESSIONAL EXPERIENCE:

Confidential, Bothell, WA

Senior Big Data Engineer

Responsibilities:

  • Work with Architects, Stakeholders and Business to design Information Architecture of Smart Data Platform for the Multistate deployment in Kubernetes Cluster.
  • Handle billions of log lines coming from several clients and analyze those using big data technologies likeHadoop (HDFS), Apache KafkaandApache Storm.
  • Hivetables are created as per requirement wereInternalorExternaltables defined with appropriatestatic, dynamic partitions and bucketing, intended for efficiency.
  • Created data pipeline for different events of ingestion, aggregation and load consumer response data in AWS S3 bucket into Hive external tables in HDFS location to serve as feed for tableau dashboards.
  • Developed Python scripts, UDF's using both Data frames/SQL and RDD/MapReduce in Spark for Data Aggregation, queries and writing data back into RDBMS through Sqoop.
  • Developed Simple to complex MapReduce streaming jobs using Python language that are implemented using Hive and Pig.
  • We used the most popular streaming toolKafkato load the data on Hadoop File system and move the same data to Cassandra NoSQL database.
  • Experienced with AWS services to smoothly manage application in the cloud and creating or modifying the instances
  • Worked on migrating MapReduce programs into Spark transformations using Spark and Scala.
  • Configured Spark Streaming to receive real time data from theKafkaand store the stream data to HDFS.
  • Migrated existingMapReduceprograms toSparkusingScalaandPython.
  • ImplementedSpark SQLto connect toHiveto read the data and distributed processing to make highly scalable.
  • Implemented Partitioning, Dynamic Partitions and Buckets inHIVEfor efficient data access.
  • Exported the Analyzed data to the relational databases using Sqoop for visualization and to generate reports for the BI team UsingTableau.
  • Worked inAWSenvironment for development and deployment of custom Hadoop applications.
  • IntegratedCassandraas a distributed persistent metadata store to provide metadata resolution for network entities on the network
  • Involved in various NOSQL databases likeHbase, Cassandrain implementing and integration.
  • Installed and configured OpenShift platform in managing Docker containers and Kubernetes Clusters.
  • Experience in designing and developing POCs in Spark using Scala to compare the performance of Spark with Hive and SQL/Oracle.
  • Created Hive base script for analyzing requirements and for processing data by designing cluster to handle huge amount of data for cross examining data loaded in Hive and Map Reduce jobs.
  • Creating S3 buckets also managing policies for S3 buckets and Utilized S3 bucket and Glacier for storage and backup onAWS.
  • Writing map reduce code using pythonin order to get rid of certain security issues in the data.
  • Used Pig Latin at client side cluster and HiveQL at server side cluster.
  • Importing the complete data from RDBMS to HDFS cluster usingSqoop

Environment: HDFS, Hive, Spark, Tableau, Yarn, Cloudera, SQL, Terraform, Splunk, RDBMS, Elastic search, Kerberos, AWS (EC2, S3, EMR, Redshift, ECS, Glue, S3, VPC, RDS etc.), Ranger, Git, Kafka, CI/CD(Jenkins), Kubernetes, Confluence, Shell Scripting, Jira.

Confidential, Coppell, TX

Big Data Engineer

Responsibilities:

  • Worked with the Spark for improving performance and optimization of the existing algorithms in Hadoop using Spark Context, Spark-SQL, PySpark, Impala, Tealeaf, Pair RDD's, DevOps, Spark YARN
  • Perform structural modifications using MapReduce, Hive and analyze data using visualization/reporting tools (Tableau).
  • Installed Kafka Producer on different severs and Scheduled to produce data for every 10 seconds Implemented Data Quality in ETL Tool Talend and having good knowledge in Data Warehousing Developed Apache Spark applications by using spark for data processing from various streaming source
  • CreateSpark Vectorized panda user definedfunctions for data manipulation and wrangling storage layer which delivers reliability to data lakes.
  • Took proof of concept projects ideas from business, lead, developed and created production pipelines that deliver business valueusing Azure Data Factory
  • Responsible to manage data coming from different sources through Kafka.
  • Worked with PowerShell and UNIX scripts for file transfer, emailing and other file related tasks.
  • Worked with NoSQL databases like HBase in creating HBase tables to load large sets of semi-structured data coming from various sources
  • Create custom logging framework for ELT pipeline logging using Append variables in Data factory
  • Developed various Oracle SQL scripts, PL/SQL packages, procedures, functions, and java code for data
  • Involved in creating HDInsight cluster in Microsoft Azure Portal also created Events hub and Azure SQL Databases
  • Strong Knowledge on architecture and components of Tealeaf, and efficient in working with Spark Core, SparkSQL. Designed and developed RDD Seeds using Scala and Cascading. Streaming data to Spark streaming using Kafka
  • Extracted and updated the data into HDFS using Sqoop import and export.
  • Creating Data factory pipelines that can bulk copy multiple tables at once from relational database to Azure data lake gen2
  • Create and maintain optimal data pipeline architecture in cloudMicrosoft Azure using Data Factory and Azure Databricks
  • Migrate data into RV Data Pipeline using Databricks, Spark SQL and.
  • Used Databricks for encrypting data using server-side encryption.
  • Built real time pipeline for streaming data using Events hub/Microsoft Azure Queue and Spark streaming.
  • Delivered de normalized data forPower BIconsumers for modeling and visualization from the produced layer in Data lake
  • UtilizedAnsible playbookfor code pipeline deployment
  • ConfiguredSpark streamingto get ongoing information from theKafkaand store the stream information to HDFS.
  • Good Exposure on Map Reduce programming using Java, PIG Latin Scripting and Distributed Application and HDFS.
  • Worked on a clustered Hadoop for Windows Azure using HDInsight and Hortonworks Data Platform for Windows.
  • Implement IOT streaming with Databricks Delta tables and Delta lake to enable ACID transaction logging
  • Exposed transformed datain Azure Spark Databricks platformto parquet formats for efficient data storage

Environment: Hadoop, Spark, MapReduce, Kafka, Scala, JAVA, Azure Data Factory, Data Lake, Databricks, Azure DevOps, PySpark, Agile, Power BI, Python, R, PL/SQL, Oracle 12c, SQL, No SQL, HBase, Scaled Agile team environment

Confidential, Greenville, SC

Data Engineer

Responsibilities:

  • Worked with HBase and Hive scripts to extract, transform and load data intoHBaseandHive.
  • Implemented multi-data centre and multi-rack Cassandra cluster
  • Successfully test Kakfa ACL's with anonymous users and with different hostnames
  • Worked on moving all log files generated from various sources toHDFSfor further processing.
  • Developed workflows using customMapReduce,Pig,Hive, andSqoop.
  • Work with network andLinuxsystem engineers to define optimum network configurations, server hardware and operating system.
  • Worked on extending Hive and Pig core functionality by writing custom UDFs using Java
  • Involved in importing data from MS SQL Server, MySQL and Teradata into HDFS using Sqoop
  • Extensively used Map Reduce component of Hadoop
  • Responsible for writing Pig scripts to process the data in the integration environment
  • Responsible for setting up HBASE and storing data into HBASE
  • Written MapReduce code to process and parsing the data from various sources and storing parsed data into HBase and Hive using HBase - Hive Integration. Worked on YUM configuration and package installation through YUM.
  • Analyzed data using Hadoop components Hive and Pig.
  • Good experience in writing Spark applications using Python and Scala.
  • Performed dimensional data modelling using Erwin to support data warehouse design and ETL development.
  • Designing and working with Cassandra Query Language knowledge in Cassandra read and write paths and internal architecture

Environment: Hadoop, HDFS, Map Reduce, Kafka, Python, Hive, Cassandra, Ansible, AWS, Git.

Confidential

Data Engineer

Responsibilities:

  • Used Hive to implement data warehouse and stored data into HDFS. Stored data into Hadoop clusters which are set up in AWS EMR.
  • Experienced in loading and transforming of large sets of structured, semi-structured and unstructured data.
  • Developed Spark jobs and Hive Jobs to summarize and transform data.
  • Involved in converting Hive/SQL queries into Spark transformations using Spark data frames, Scala and Python.
  • Performed Data Preparation by using Pig Latin to get the right data format needed.
  • Processed the image data through the Hadoop distributed system by using Map and Reduce then stored into HDFS.
  • Worked on a python script to extract data from Natuzzi databases and transfer it to AWS S3
  • Exporting of a result set from HIVE to MySQL using Sqoop export tool for further processing.
  • Collecting and aggregating large amounts of log data and staging data in HDFS for further analysis
  • Utilized Matplotlib to graph the manipulated data frames for further analysis.
  • Worked on development of data ingestion pipelines using ETL tool, Talend & bash scripting with big data technologies including but not limited to Hive, Impala, Spark, Kafka, and Talend.
  • Experience in developing scalable & secure data pipelines for large datasets.
  • Wrote python code to manipulate and organize data frame such that all attributes in each field were formatted identically.
  • Experience in data transformations using Map-Reduce, HIVE for different file formats.
  • Involved in converting Hive/SQL queries into transformations using Python
  • Selected and generated data into csv files and stored them into AWS S3 by using AWS EC2 and then structured and stored in AWS Redshift.
  • Developed UDFs in Python to create custom transformations and aggregations for datasets.

Environment: Hadoop, MapReduce, AWS, HDFS, Pig, HiveQL, MySQL, UNIX Shell Scripting, Tableau, Java, Spark

Confidential

Hadoop Developer

Responsibilities:

  • Involved in story-driven Agile development methodology and actively participated in daily scrum meetings.
  • Created Hive, Phoenix, HBase tables and HBase integrated Hive tables as per the design using ORC file format and Snappy compression.
  • Inserted Overwriting the HIVE data with HBase data daily to get fresh data every day and used Sqoop to load data from DB2 into HBASE environment...
  • Implement automation, traceability, and transparency for every step of the process to build trust in data and streamline data science efforts using Python, Java, Hadoop streaming, Apache Spark, SparkSQL, Scala, Hive, and Pig.
  • Creating Sqoop to import data from SQl, Oracle, and Teradata to HDFS
  • Designed highly efficient data model for optimizing large-scale queries utilizing Hive complex data types and Parquet file format.
  • Extensively used DB2 Database to support the SQL
  • Performed data transformations like filtering, sorting, and aggregation using Pig
  • Created Hive tables to push the data to MongoDB.
  • Automated workflows using shell scripts and Control-M jobs to pull data from various databases into Hadoop Data Lake.
  • Loading the data from the different Data sources like (Teradata and DB2) into HDFS using SQOOP and load into Hive tables, which are partitioned.

Environment: Hadoop, HDFS, Hive, Pig, DB2, Java, Python, Oracle 9i, SQL, Splunk, Unix, Shell Scripting.

We'd love your feedback!